17 research outputs found
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
Omnidirectional images (ODIs) have become increasingly popular, as their
large field-of-view (FoV) can offer viewers the chance to freely choose the
view directions in immersive environments such as virtual reality. The M\"obius
transformation is typically employed to further provide the opportunity for
movement and zoom on ODIs, but applying it to the image level often results in
blurry effect and aliasing problem. In this paper, we propose a novel deep
learning-based approach, called \textbf{OmniZoomer}, to incorporate the
M\"obius transformation into the network for movement and zoom on ODIs. By
learning various transformed feature maps under different conditions, the
network is enhanced to handle the increasing edge curvatures, which alleviates
the blurry effect. Moreover, to address the aliasing problem, we propose two
key components. Firstly, to compensate for the lack of pixels for describing
curves, we enhance the feature maps in the high-resolution (HR) space and
calculate the transformed index map with a spatial index generation module.
Secondly, considering that ODIs are inherently represented in the spherical
space, we propose a spherical resampling module that combines the index map and
HR feature maps to transform the feature maps for better spherical correlation.
The transformed feature maps are decoded to output a zoomed ODI. Experiments
show that our method can produce HR and high-quality ODIs with the flexibility
to move and zoom in to the object of interest. Project page is available at
http://vlislab22.github.io/OmniZoomer/.Comment: Accepted by ICCV 202
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
structure control is needed. In this paper, we aim to ``dig out" the
capabilities that T2I models have implicitly learned, and then explicitly use
them to control the generation more granularly. Specifically, we propose to
learn simple and small T2I-Adapters to align internal knowledge in T2I models
with external control signals, while freezing the original large T2I models. In
this way, we can train various adapters according to different conditions, and
achieve rich control and editing effects. Further, the proposed T2I-Adapters
have attractive properties of practical value, such as composability and
generalization ability. Extensive experiments demonstrate that our T2I-Adapter
has promising generation quality and a wide range of applications.Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapte
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Recent CLIP-guided 3D optimization methods, such as DreamFields and
PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D
synthesis. However, due to scratch training and random initialization without
prior knowledge, these methods often fail to generate accurate and faithful 3D
structures that conform to the input text. In this paper, we make the first
attempt to introduce explicit 3D shape priors into the CLIP-guided 3D
optimization process. Specifically, we first generate a high-quality 3D shape
from the input text in the text-to-shape stage as a 3D shape prior. We then use
it as the initialization of a neural radiance field and optimize it with the
full prompt. To address the challenging text-to-shape generation task, we
present a simple yet effective approach that directly bridges the text and
image modalities with a powerful text-to-image diffusion model. To narrow the
style domain gap between the images synthesized by the text-to-image diffusion
model and shape renderings used to train the image-to-shape generator, we
further propose to jointly optimize a learnable text prompt and fine-tune the
text-to-image diffusion model for rendering-style image generation. Our method,
Dream3D, is capable of generating imaginative 3D content with superior visual
quality and shape accuracy compared to state-of-the-art methods.Comment: Accepted by CVPR 2023. Project page:
https://bluestyle97.github.io/dream3d
One for All, All for One: Learning and Transferring User Embeddings for Cross-Domain Recommendation
Cross-domain recommendation is an important method to improve recommender
system performance, especially when observations in target domains are sparse.
However, most existing techniques focus on single-target or dual-target
cross-domain recommendation (CDR) and are hard to be generalized to CDR with
multiple target domains. In addition, the negative transfer problem is
prevalent in CDR, where the recommendation performance in a target domain may
not always be enhanced by knowledge learned from a source domain, especially
when the source domain has sparse data. In this study, we propose CAT-ART, a
multi-target CDR method that learns to improve recommendations in all
participating domains through representation learning and embedding transfer.
Our method consists of two parts: a self-supervised Contrastive AuToencoder
(CAT) framework to generate global user embeddings based on information from
all participating domains, and an Attention-based Representation Transfer (ART)
framework which transfers domain-specific user embeddings from other domains to
assist with target domain recommendation. CAT-ART boosts the recommendation
performance in any target domain through the combined use of the learned global
user representation and knowledge transferred from other domains, in addition
to the original user embedding in the target domain. We conducted extensive
experiments on a collected real-world CDR dataset spanning 5 domains and
involving a million users. Experimental results demonstrate the superiority of
the proposed method over a range of prior arts. We further conducted ablation
studies to verify the effectiveness of the proposed components. Our collected
dataset will be open-sourced to facilitate future research in the field of
multi-domain recommender systems and user modeling.Comment: 9 pages, accepted by WSDM 202
VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis
With the emergence of neural radiance fields (NeRFs), view synthesis quality
has reached an unprecedented level. Compared to traditional mesh-based assets,
this volumetric representation is more powerful in expressing scene geometry
but inevitably suffers from high rendering costs and can hardly be involved in
further processes like editing, posing significant difficulties in combination
with the existing graphics pipeline. In this paper, we present a hybrid
volume-mesh representation, VMesh, which depicts an object with a textured mesh
along with an auxiliary sparse volume. VMesh retains the advantages of
mesh-based assets, such as efficient rendering, compact storage, and easy
editing, while also incorporating the ability to represent subtle geometric
structures provided by the volumetric counterpart. VMesh can be obtained from
multi-view images of an object and renders at 2K 60FPS on common consumer
devices with high fidelity, unleashing new opportunities for real-time
immersive applications.Comment: Project page: https://bennyguo.github.io/vmesh
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
We introduce HOSNeRF, a novel 360{\deg} free-viewpoint rendering method that
reconstructs neural radiance fields for dynamic human-object-scene from a
single monocular in-the-wild video. Our method enables pausing the video at any
frame and rendering all scene details (dynamic humans, objects, and
backgrounds) from arbitrary viewpoints. The first challenge in this task is the
complex object motions in human-object interactions, which we tackle by
introducing the new object bones into the conventional human skeleton hierarchy
to effectively estimate large object deformations in our dynamic human-object
model. The second challenge is that humans interact with different objects at
different times, for which we introduce two new learnable object state
embeddings that can be used as conditions for learning our human-object
representation and scene representation, respectively. Extensive experiments
show that HOSNeRF significantly outperforms SOTA approaches on two challenging
datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and
compelling examples of 360{\deg} free-viewpoint renderings from single videos
will be released in https://showlab.github.io/HOSNeRF.Comment: Project page: https://showlab.github.io/HOSNeR
Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems
Existing benchmark datasets for recommender systems (RS) either are created
at a small scale or involve very limited forms of user feedback. RS models
evaluated on such datasets often lack practical values for large-scale
real-world applications. In this paper, we describe Tenrec, a novel and
publicly available data collection for RS that records various user feedback
from four different recommendation scenarios. To be specific, Tenrec has the
following five characteristics: (1) it is large-scale, containing around 5
million users and 140 million interactions; (2) it has not only positive user
feedback, but also true negative feedback (vs. one-class recommendation); (3)
it contains overlapped users and items across four different scenarios; (4) it
contains various types of user positive feedback, in forms of clicks, likes,
shares, and follows, etc; (5) it contains additional features beyond the user
IDs and item IDs. We verify Tenrec on ten diverse recommendation tasks by
running several classical baseline models per task. Tenrec has the potential to
become a useful benchmark dataset for a majority of popular recommendation
tasks
Using service grammar to diagnose bgp configuration errors
Often network components work correctly, yet end-to-end services don’t. This happens if configuration parameters of components are set to incorrect values. Configuration is a fundamental operation for logically integrating components to set up end-to-end services. Configuration errors arise frequently because transforming end-to-end service requirements into component configurations is inherently difficult. Such transformation is largely performed in a manual and localized fashion, resulting in high cost of network operations. The Service Grammar technique has been developed to solve the configuration error diagnosis problem, and, more generally, to formalize the process of building complex systems via configuration. At its core is a Requirements Language that contains global, high-level constraints upon configuration parameters. These are derived from identifying the notion of “correct configuration ” associated with different protocols. These are composed to create system-wide requirements on architecture and policies. A Diagnosis Engine checks if constraints in the Requirements Language are true given definite component configurations, and is used recursively to check composite requirements. This paper describes an application of Service Grammar to diagnosing BGP configuration errors. As BGP architecture and policies differ widely from one network to another, it is not possible using previous techniques to check if router configurations implement the intended requirements. Our tools enable administrators to specify system-wide, network-specific requirements and check if they are correctly implemented by component configurations